Created by: Kellie Ottoboni, Rochelle Terman and Chris Krogslun (UC Berkeley)
Edited by: John S. Erickson (The Rensselaer IDEA)
Updated 17 Feb 2021 (JSE)
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Thus before you can even start on any sort of sophisticated analysis or plotting, you first must:
Historically there are two schools of thought within the R community…and at RPI!
tidyverse uses syntax that’s unlike base R and is superfluous.tidyverse tools”
tidyverse methods are easy to use, more readable than base R, and speed up the tidying process.This tutorial shows you some of the tidyverse tools so you can make an informed decision about whether you want to continue to suffer through Base R or enter the tidyverse.
See also: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf (Updated 17 Jan 2021)
For this unit, we’ll be working with the “Gapminder” dataset, which is excerpt of the data available at Gapminder.org. For each of 142 countries, the data provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
gapminder <- read.csv("../data/gapminder-FiveYearData.csv",
stringsAsFactors = TRUE)
head(gapminder)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
Another, more complete way to sanity-check our data types:
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
So far, you’ve seen the basics of manipulating data frames, e.g. subsetting, merging, and basic calculations. For instance, we can use base R functions to calculate summary statistics across groups of observations, e.g. the mean GDP per capita within each region:
## [1] 2193.755
## [1] 7136.11
## [1] 7902.15
But this isn’t ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
We want our code to more closely resemble our thinking, which is “get the means by continent”
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the dplyr grammar easier to read.
Here we’re going to cover six of the most commonly used functions as well as using pipes (%>%) to combine them.
select()filter()group_by()summarize()mutate()arrange()dplyr will be installed when installing tidyverse, or you can install it by itself:
Now let’s load the package:
select()Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the columns corresponding to variables we select.
## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
If we open up year_country_gdp, we’ll see that it only contains the year, country and gdpPercap. This is equivalent to the base R subsetting function:
## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
But, as we will see, dplyr makes for much more readible, efficient code because of its pipe operator.
Above, we used what’s called ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes.
Pipes take the input on the left side of the %>% symbol and pass it in as the first argument to the function on the right side.
Since the pipe grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.
%>%select() function. In this case we don’t specify the data object to use in the call to select() since we’ve piped it in.Fun Fact: There is a good chance you have encountered pipes before in the Linux shell. In R, a pipe symbol is %>% while in the shell it is |. But the concept is the same!
filter()Now let’s say we’re only interested in African countries. We can combine select() and filter() to select only the observations where continent is Africa.
year_country_gdp_africa <- gapminder %>%
filter(continent == "Africa") %>%
select(year,country,gdpPercap)First we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder dataframe to the select() function.
Both the select() and filter() functions subset the data frame. The difference is that select() extracts certain columns, while filter() extracts certain rows.
Note: The order of operations is very important in this case. If we used select() first, filter() would not be able to find the variable continent since we would have removed it in the previous step.
A common task you’ll encounter when working with data is running calculations on different groups within the data. For example, what if we wanted to calculate the mean GDP per capita for each continent?
In base R, you would run the mean() function for each subset of data:
## [1] 2193.755
## [1] 7136.11
## [1] 7902.15
## [1] 14469.48
## [1] 18621.61
That’s a lot of repetition! To make matters worse, what if we wanted to then add these values to our original data frame as a new column? We would have to write something like this:
gapminder$mean.continent.GDP <- NA # Initialize a new column
# Write the values into the new column, for each continent
gapminder$mean.continent.GDP[gapminder$continent == "Africa"] <- mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
gapminder$mean.continent.GDP[gapminder$continent == "Americas"] <- mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
gapminder$mean.continent.GDP[gapminder$continent == "Asia"] <- mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
gapminder$mean.continent.GDP[gapminder$continent == "Europe"] <- mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
gapminder$mean.continent.GDP[gapminder$continent == "Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])You can see how this can get pretty tedious, especially if we want to calculate more complicated or refined statistics. We could use loops or apply functions, but these can be difficult, slow, or error-prone.
The abstract problem we’re encountering here is known as split-apply-combine:
We want to split our data into groups (in this case continents), apply some calculations on each group, then combine the results together afterwards.
There are ways to do split-apply-combine operations using the apply() family of functions, but those are error prone and messy.
Luckily, dplyr offers a much cleaner solution to this problem.
group_by()We’ve already seen how filter() can help us select observations that meet certain criteria (in the above: continent == "Europe"). More helpful, however, is the group_by() function, which will essentially use every unique criteria that we could have used in filter().
A grouped_df can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).
summarize()group_by() on its own is not particularly interesting. It’s much more exciting used in conjunction with the summarize() function, which allows use to create new variable(s) by applying transformations to variables in each of the continent-specific data frames.
In other words, when using the group_by() function, we split our original dataframe into multiple pieces, which we then apply summary functions to (e.g. mean() or sd()) within summarize(). The output is a new dataframe reduced in size, with one row per group.
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)## # A tibble: 5 x 2
## continent mean_gdpPercap
## <fct> <dbl>
## 1 Africa 2194.
## 2 Americas 7136.
## 3 Asia 7902.
## 4 Europe 14469.
## 5 Oceania 18622.
That allowed us to calculate the mean gdpPercap for each continent. But it gets even better – the function group_by() allows us to group by multiple variables. Let’s group by year and continent.
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: continent [1]
## continent year mean_gdpPercap
## <fct> <int> <dbl>
## 1 Africa 1952 1253.
## 2 Africa 1957 1385.
## 3 Africa 1962 1598.
## 4 Africa 1967 2050.
## 5 Africa 1972 2340.
## 6 Africa 1977 2586.
That is already quite powerful, but it gets even better! You’re not limited to defining one new variable in summarize().
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## # A tibble: 6 x 6
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1253. 983. 4570010. 6317450.
## 2 Africa 1957 1385. 1135. 5093033. 7076042.
## 3 Africa 1962 1598. 1462. 5702247. 7957545.
## 4 Africa 1967 2050. 2848. 6447875. 8985505.
## 5 Africa 1972 2340. 3287. 7305376. 10130833.
## 6 Africa 1977 2586. 4142. 8328097. 11585184.
mutate()What if we wanted to extend our original data frame with these values instead of creating a new object? For this, we can use the mutate() function, which is similar to summarize() except it creates new variables to the same dataframe that you pass into it.
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
head(gapminder_with_extra_vars)## # A tibble: 6 x 10
## # Groups: continent, year [6]
## country year pop continent lifeExp gdpPercap mean_gdpPercap sd_gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… 1952 8.43e6 Asia 28.8 779. 5195. 18635.
## 2 Afghan… 1957 9.24e6 Asia 30.3 821. 5788. 19507.
## 3 Afghan… 1962 1.03e7 Asia 32.0 853. 5729. 16416.
## 4 Afghan… 1967 1.15e7 Asia 34.0 836. 5971. 14063.
## 5 Afghan… 1972 1.31e7 Asia 36.1 740. 8187. 19088.
## 6 Afghan… 1977 1.49e7 Asia 38.4 786. 7791. 11816.
## # … with 2 more variables: mean_pop <dbl>, sd_pop <dbl>
We can use also use mutate() to create new variables prior to (or even after) summarizing information.
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## # A tibble: 6 x 8
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop mean_gdp_billion
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1253. 983. 4570010. 6.32e6 5.99
## 2 Africa 1957 1385. 1135. 5093033. 7.08e6 7.36
## 3 Africa 1962 1598. 1462. 5702247. 7.96e6 8.78
## 4 Africa 1967 2050. 2848. 6447875. 8.99e6 11.4
## 5 Africa 1972 2340. 3287. 7305376. 1.01e7 15.1
## 6 Africa 1977 2586. 4142. 8328097. 1.16e7 18.7
## # … with 1 more variable: sd_gdp_billion <dbl>
arrange()As a last step, let’s say we want to sort the rows in our data frame according to values in a certain column. We can use the arrange() function to do this. For instance, let’s organize our rows by year (recent first), and then by continent.
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
arrange(desc(year), continent)
head(gapminder_with_extra_vars)## # A tibble: 6 x 10
## # Groups: continent, year [1]
## country year pop continent lifeExp gdpPercap mean_gdpPercap sd_gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria 2007 3.33e7 Africa 72.3 6223. 3089. 3618.
## 2 Angola 2007 1.24e7 Africa 42.7 4797. 3089. 3618.
## 3 Benin 2007 8.08e6 Africa 56.7 1441. 3089. 3618.
## 4 Botswa… 2007 1.64e6 Africa 50.7 12570. 3089. 3618.
## 5 Burkin… 2007 1.43e7 Africa 52.3 1217. 3089. 3618.
## 6 Burundi 2007 8.39e6 Africa 49.6 430. 3089. 3618.
## # … with 2 more variables: mean_pop <dbl>, sd_pop <dbl>
# without pipes:
gapminder_with_extra_vars <- arrange(
mutate(
group_by(gapminder, continent, year),
mean_gdpPercap = mean(gdpPercap)
),
desc(year), continent)Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy
The two most important properties of tidy data are:
Tidy data is easier to work with, because you have a consistent way of referring to variables (as column names) and observations (as row indices). It then becomes easy to manipulate, visualize, and model.
For more on the concept of tidy data, read Hadley Wickham’s paper here
“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham
Tabular datasets can be arranged in many ways. For instance, consider the data below. Each data set displays information on heart rate observed in individuals across 3 different time periods. But the data are organized differently in each table.
wide <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
time1 = c(67, 80, 64),
time2 = c(56, 90, 50),
time3 = c(70, 67, 101)
)
wide## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
long <- data.frame(
name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10
Question: Which one of these do you think is the tidy format?
Answer: The first dataframe (the “wide” one) would not be considered tidy because values (i.e., heartrate) are spread across multiple columns.
We often refer to these different structurs as “long” vs. “wide” formats. In the “long” format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the “wide” format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). In the above case, we had the same kind of data (heart rate) entered across 3 different columns, corresponding to three different time periods.
You may find data input may be simpler and some programs/functions may prefer the “wide” format. However, many of R’s functions have been designed assuming you have “long” format data.
gapminder DataLet’s revisit the structure of our gapminder dataframe:
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
Question: Is this data frame wide or long?
Answer: This data frame is somewhere in between the purely ‘long’ and ‘wide’ formats:
continent, country, year)pop, lifeExp, gdpPercap)Despite not having ALL observations in one column, this intermediate format makes sense given that all three observation variables have different units. As we have seen, many of the functions in R are often vector based, and you usually do not want to do mathematical operations on values with different units.
On the other hand, there are some instances in which a purely long or wide format is ideal (e.g. plotting). Likewise, sometimes you’ll get data on your desk that is poorly organized, and you’ll need to reshape it.
Thankfully, the tidyr package will help you efficiently transform your data regardless of original format.
gather()Until now, we’ve been using the nicely formatted original gapminder dataset. This dataset is not quite wide and not quite long – it’s something in the middle, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset.
## continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
## 1 Africa Algeria 2449.0082 3013.9760 2550.8169
## 2 Africa Angola 3520.6103 3827.9405 4269.2767
## 3 Africa Benin 1062.7522 959.6011 949.4991
## 4 Africa Botswana 851.2411 918.2325 983.6540
## 5 Africa Burkina Faso 543.2552 617.1835 722.5120
## 6 Africa Burundi 339.2965 379.5646 355.2032
## gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
## 1 3246.9918 4182.6638 4910.4168 5745.1602 5681.3585
## 2 5522.7764 5473.2880 3008.6474 2756.9537 2430.2083
## 3 1035.8314 1085.7969 1029.1613 1277.8976 1225.8560
## 4 1214.7093 2263.6111 3214.8578 4551.1421 6205.8839
## 5 794.8266 854.7360 743.3870 807.1986 912.0631
## 6 412.9775 464.0995 556.1033 559.6032 621.8188
## gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 lifeExp_1952
## 1 5023.2166 4797.2951 5288.0404 6223.3675 43.077
## 2 2627.8457 2277.1409 2773.2873 4797.2313 30.015
## 3 1191.2077 1232.9753 1372.8779 1441.2849 38.223
## 4 7954.1116 8647.1423 11003.6051 12569.8518 47.622
## 5 931.7528 946.2950 1037.6452 1217.0330 31.975
## 6 631.6999 463.1151 446.4035 430.0707 39.031
## lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 lifeExp_1977 lifeExp_1982
## 1 45.685 48.303 51.407 54.518 58.014 61.368
## 2 31.999 34.000 35.985 37.928 39.483 39.942
## 3 40.358 42.618 44.885 47.014 49.190 50.904
## 4 49.618 51.520 53.298 56.024 59.319 61.484
## 5 34.906 37.814 40.697 43.591 46.137 48.122
## 6 40.533 42.045 43.548 44.057 45.910 47.471
## lifeExp_1987 lifeExp_1992 lifeExp_1997 lifeExp_2002 lifeExp_2007 pop_1952
## 1 65.799 67.744 69.152 70.994 72.301 9279525
## 2 39.906 40.647 40.963 41.003 42.731 4232095
## 3 52.337 53.919 54.777 54.406 56.728 1738315
## 4 63.622 62.745 52.556 46.634 50.728 442308
## 5 49.557 50.260 50.324 50.650 52.295 4469979
## 6 48.211 44.736 45.326 47.360 49.580 2445618
## pop_1957 pop_1962 pop_1967 pop_1972 pop_1977 pop_1982 pop_1987 pop_1992
## 1 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373
## 2 4561361 4826015 5247469 5894858 6162675 7016384 7874230 8735988
## 3 1925173 2151895 2427334 2761407 3168267 3641603 4243788 4981671
## 4 474639 512764 553541 619351 781472 970347 1151184 1342614
## 5 4713416 4919632 5127935 5433886 5889574 6634596 7586551 8878303
## 6 2667518 2961915 3330989 3529983 3834415 4580410 5126023 5809236
## pop_1997 pop_2002 pop_2007
## 1 29072015 31287142 33333216
## 2 9875024 10866106 12420476
## 3 6066080 7026113 8078314
## 4 1536536 1630347 1639131
## 5 10352843 12251209 14326203
## 6 6121610 7021078 8390505
The first step towards getting our nice intermediate data format is to first convert from the wide to the long format. The function gather() will ‘gather’ the observation variables into a single variable. This is sometimes called “melting” your data, because it melts the table from wide to long. Those data will be melted into two variables: one for the variable names, and the other for the variable values.
## continent country obstype_year obs_values
## 1 Africa Algeria gdpPercap_1952 2449.0082
## 2 Africa Angola gdpPercap_1952 3520.6103
## 3 Africa Benin gdpPercap_1952 1062.7522
## 4 Africa Botswana gdpPercap_1952 851.2411
## 5 Africa Burkina Faso gdpPercap_1952 543.2552
## 6 Africa Burundi gdpPercap_1952 339.2965
Notice that we put three arguments into the gather() function:
obstype_year),obs_value),3:38, signalling columns 3 through 38) that we want to gather into one variable. Notice that we don’t want to melt down columns 1 and 2; these are considered “ID” variables.select()If there are many columns or they’re named in a consistent pattern, we might not want to select them using the column numbers. Sometimes it’s easier to use some information contained in the names themselves.
We can select variables using:
x:z to select all variables between x and z-y to exclude ystarts_with(x, ignore.case = TRUE): all names that starts with xends_with(x, ignore.case = TRUE): all names that ends with xcontains(x, ignore.case = TRUE): all names that contain xSee the select() function in dplyr for more options.
For instance, here we do the same gather operation with (1) the starts_with function, and (2) the - operator:
# with the starts_with() function
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, starts_with('pop'),
starts_with('lifeExp'), starts_with('gdpPercap'))
head(gap_long)## continent country obstype_year obs_values
## 1 Africa Algeria pop_1952 9279525
## 2 Africa Angola pop_1952 4232095
## 3 Africa Benin pop_1952 1738315
## 4 Africa Botswana pop_1952 442308
## 5 Africa Burkina Faso pop_1952 4469979
## 6 Africa Burundi pop_1952 2445618
# with the - operator
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, -continent, -country)
head(gap_long)## continent country obstype_year obs_values
## 1 Africa Algeria gdpPercap_1952 2449.0082
## 2 Africa Angola gdpPercap_1952 3520.6103
## 3 Africa Benin gdpPercap_1952 1062.7522
## 4 Africa Botswana gdpPercap_1952 851.2411
## 5 Africa Burkina Faso gdpPercap_1952 543.2552
## 6 Africa Burundi gdpPercap_1952 339.2965
However you choose to do it, notice that the output collapses all of the measure variables into two columns: one containing new ID variable, the other containing the observation value for that row.
separate()You’ll notice that in our long dataset, obstype_year actually contains 2 pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.
We can use the separate() function to split the character strings into multiple variables:
gap_long_sep <- gap_long %>%
separate(obstype_year, into = c('obs_type','year'), sep = "_") %>%
mutate(year = as.integer(year))
head(gap_long_sep)## continent country obs_type year obs_values
## 1 Africa Algeria gdpPercap 1952 2449.0082
## 2 Africa Angola gdpPercap 1952 3520.6103
## 3 Africa Benin gdpPercap 1952 1062.7522
## 4 Africa Botswana gdpPercap 1952 851.2411
## 5 Africa Burkina Faso gdpPercap 1952 543.2552
## 6 Africa Burundi gdpPercap 1952 339.2965
If you didn’t use tidyr to do this, you’d have to use the strsplit function and use multiple lines of code to replace the column in gap_long with two new columns. This solution is much cleaner.
The opposite of gather() is spread(). It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original “medium” format.
## continent country year gdpPercap lifeExp pop
## 1 Africa Algeria 1952 2449.008 43.077 9279525
## 2 Africa Algeria 1957 3013.976 45.685 10270856
## 3 Africa Algeria 1962 2550.817 48.303 11000948
## 4 Africa Algeria 1967 3246.992 51.407 12760499
## 5 Africa Algeria 1972 4182.664 54.518 14760787
## 6 Africa Algeria 1977 4910.417 58.014 17152804
All we need is some quick fixes to make this dataset identical to the original gapminder dataset:
## continent country year gdpPercap lifeExp pop
## 1 Africa Algeria 1952 2449.008 43.077 9279525
## 2 Africa Algeria 1957 3013.976 45.685 10270856
## 3 Africa Algeria 1962 2550.817 48.303 11000948
## 4 Africa Algeria 1967 3246.992 51.407 12760499
## 5 Africa Algeria 1972 4182.664 54.518 14760787
## 6 Africa Algeria 1977 4910.417 58.014 17152804
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## country year pop continent lifeExp gdpPercap
## 1 Algeria 1952 9279525 Africa 43.077 2449.008
## 2 Algeria 1957 10270856 Africa 45.685 3013.976
## 3 Algeria 1962 11000948 Africa 48.303 2550.817
## 4 Algeria 1967 12760499 Africa 51.407 3246.992
## 5 Algeria 1972 14760787 Africa 54.518 4182.664
## 6 Algeria 1977 17152804 Africa 58.014 4910.417
# arrange by country, continent, and year
gap_medium <- gap_medium %>%
arrange(country,continent,year)
head(gap_medium)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
dplyr and tidyr have many more functions to help you wrangle and manipulate your data. See the Data Wrangling Cheat Sheet for more.
There are some other useful packages in the tidyverse:
ggplot2 for plotting (I’ll cover this in module 8)readr and haven for reading in data with structure other than csvstringr, lubridate, forcats for manipulating strings, dates, and factors, respectivelyUse dplyr to create a data frame containing the median lifeExp for each continent
Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.
Use dplyr to: add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year. Arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)
hint: You might have to ungroup() before you arrange().
country, year, and gdpPercap_diff columns. Use tidyr put it in wide format so that countries are rows and years are columns.